Intel GFX Cl 


What services do we provide, our roadmaps, and lessons learnt! 


Martin Peres 
Sept 20" 2017 


(nte) OpenSource 


Software Toe PRET © TECHNOLOGY CENTER 


Agenda 


e Introduction: Why Cl, and objectives 


State of Intel GFX Cl, and future plans 
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Why do we need Continuous Integration (Cl)? 


e Cl allows putting the cost of integration on the person making changes: 
o It scales better with the number of developers! 
o Less time spent on bug fixing in post merge 
o Provides better global understanding to developers 

e Cl keeps the integration tree in working condition at all time 
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Objectives of Cl 


e Provides an accurate view of the state of the HW/SW 
e Results should be: 


O 


O 


O 
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Transparent: Should contain the full HW and SW configuration 

Fast: Basic results in under 30 minutes, complete ones in half a day 

Visible: make the results public and hard to miss (reply in ML) 

Stable: noise level should be zero (be aggressive at blacklisting unstable tests) 
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Intel GFX Cl 
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Intel GFX Cl - https://intel-gfx-ci.01.org 


Current state 


e Provide timely, public, stable and transparent results for: 


o Trees: 
m Pre-merge: DRM-tip, IGT 
= Post-merge: DRM-tip, Linus’ tree, Linux-next, *-fixes, drm-internal 
o Machines (total of 40 systems / 19 different platforms (Gen 3 to current)) 
= GDG (Gen3, 2004) -> GLK (not released yet) 
m= Sharded machines: 6 KBL, 6 HSW, 6 SNB, 8 APL 
m SKL Xeon 
a GVT-d BDW and SKL (Virtualization) 
o Test suites: 
a = IGT: 
e Fast-feedback: 279 tests, ran on all machines 
e Full KMS + some GEM tests: ~2500 tests, ran on sharded machines 
o Throughput 
m From 22k tests/day (Aug 2016) to +400k tests/day (Aug 2017) (see next slide) 
= Bug filing: usually under 1h during working hours 
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Cl throughput per day (from 08/2016 until today) 


NUMBER OF TESTS EXECUTED / DAY ON Cl 
(MAX FROM 2 DAYS PER MONTH 


= tof machines x number tests x number of runs) 


Max 
210168 


eam # Of machines 


Linear (Max) 


9/2016 10/2016 11/2016 12/2016 1/2017 2/2017 3/2017 
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Intel-GFX Cl: Roadmap 


Plans 
e Provide timely, visible, stable and transparent results for: 
e Machines: 


o Keep adding new platforms / hardware configurations 
o More display types (including chamelium) 
e Test suites: 


o Full IGT on all machines. Requires: 
=m Developers to improve IGT to run in < 6 hours (kms, gem, prime) 
= Squashing all patch series in one tree 
= Auto-bisect issues to the offending patch series 
o Performance and rendering. Requires: 
= EzBench support 
= Better prioritization of tasks for machine time 


OpenSource 


Software Se Se TECHNOLOGY CENTER 
z 5/2 Ses he 


Contacts 


Tomi Sarvela 

e Infrastructure and most of the automation software 
Martin Peres 

e  Ezbench, Cl bug log, bug filing 
Arkadiusz Hiler 

e IGT maintainer, back up for Tomi, Pre-silicon Cl 
Petri Latvala 


e IGT maintainer, Ezbench support 
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Lessons learnt 
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Key findings to replicate our system 


e What is not tested continuously is broken 
e Bugzilla is not a good tool to track test failures 
e Noise is the enemy #1: 
o Treat every failure as a bug 
o Run tests in a loop 
o Collect failure statistics and history! 
e Make sure developers own the Cl system 
o The Cl team works for developers 
o Developers suggest improvements to the systems and improve test suites 
e Have automated metrics for everything! 
e Took us a year to get the basic IGT testing stable on 2004+ hardware 
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What is needed for HW Cl 


e Requirements for making a useful Cl system: 


o Infrastructure: 

Physical space 

Enough power and cooling 

= Power cutters for all machines 

m Reliable network (the simpler the better) 


o Hardware: 
m Machines with different configurations (chipsets, RAM, connectors, screens) 
m Ways to resume the machine (RTC wake, ...) 


o Software: 
m Scheduling jobs (Jenkins, ...) 
= Graphics stack compilation automation 
m Automatic deployment and reboot 
m External watchdog 


o Humans: 
m Qualified engineers to make bugs 
m Developers to act quickly on bug reports 
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Challenges of doing kernel Cl 


e Booting garbage kernels: 
o Boot, network, and/or filesystem broken 
e Getting traces out, especially during suspend/resume: 
o Kernel parameters: use “nmi_watchdog=panic,auto panic=1 softdog.soft_panic=1” 
o Use pstore for EFl-capable HW, serial consoles for others 
e Dealing with memory corruptions: 
o Will trash your partitions 
o Need automated script to re-deploy machines 
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Cl Bootstrapping 


e Step 0: Gather hardware, and test suites 

e Step 1: Run the test suites automatically on this hardware 

e Step 2: Report failures to a tool that will check if the failure is known 

e Step 3: File bugs about unknown failures 

e Step 4: When no new failure happen for some time, add to pre-merge 
e Step 5: Goto step 0 
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Conclusion 


Cl is good! 


Join us, and lets collaborate! 


Questions / discussion 


Software 
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